Rand Score (Rand Index)#
rand_score (the Rand index) measures how similar two clusterings / partitions are by looking at all pairs of samples.
A good mental model:
Each clustering turns the dataset into a binary relation: for every pair
(i, j), are they in the same cluster or different clusters?The Rand index is the accuracy of those pairwise decisions.
Quick import#
from sklearn.metrics import rand_score
Learning goals#
Understand Rand index as pairwise agreement (TP/TN/FP/FN over pairs)
Implement
rand_scorefrom scratch in NumPy (efficiently, without anO(n^2)loop)Visualize where the score comes from on a small example
See common pitfalls (why it can be inflated when there are many clusters)
Use Rand score to select a simple clustering hyperparameter
Prerequisites#
NumPy basics (
shape, broadcasting,np.unique)Combinatorics: number of pairs
\(inom{n}{2}\)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import rand_score
from sklearn.metrics.cluster import pair_confusion_matrix
from sklearn.metrics import adjusted_rand_score
pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition: Rand index as pairwise accuracy#
Let there be n samples and two labelings:
ground truth (or reference) labels: (y)
predicted clustering labels: (\hat{y})
For every unordered pair ((i, j)) with (i < j), define two same-cluster indicators:
Now count pairs into the familiar 2×2 confusion matrix:
TP: same in
yand same in\(\hat{y}\)TN: different in
yand different in\(\hat{y}\)FP: different in
ybut same in\(\hat{y}\)(a merge error)FN: same in
ybut different in\(\hat{y}\)(a split error)
Formally:
There are (inom{n}{2}) total pairs, so the Rand index is:
Note: scikit-learn’s pair_confusion_matrix counts ordered pairs (i, j) with i != j,
so its entries sum to n(n-1) and are exactly 2× the i<j counts above.
Key properties:
Label permutation invariance: only equality matters; relabeling clusters doesn’t change the score.
Symmetry: (\mathrm{RI}(y,\hat{y}) = \mathrm{RI}(\hat{y},y)).
Range: ([0,1]). For
n < 2, scikit-learn returns1.0by convention.
# A tiny toy example (small n so we can visualize all pairs)
y_true = np.array([0, 0, 0, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 0, 0, 0, 2, 3, 3])
print('y_true:', y_true)
print('y_pred:', y_pred)
print('rand_score (sklearn):', rand_score(y_true, y_pred))
pcm_ord = pair_confusion_matrix(y_true, y_pred)
print('pair_confusion_matrix (sklearn, ordered pairs):')
print(pcm_ord)
print('pair confusion on i<j (unordered) is pcm_ord // 2:')
print(pcm_ord // 2)
y_true: [0 0 0 1 1 2 2 2]
y_pred: [1 1 0 0 0 2 3 3]
rand_score (sklearn): 0.7857142857142857
pair_confusion_matrix (sklearn, ordered pairs):
[[38 4]
[ 8 6]]
pair confusion on i<j (unordered) is pcm_ord // 2:
[[19 2]
[ 4 3]]
Visual intuition: every clustering is a pairwise matrix#
For a labeling y, define an (n imes n) matrix:
Blocks of 1s correspond to clusters.
Comparing (A_y) and (A_{\hat{y}}) tells you which pairs are treated the same.
Below we color each pair (i, j) as:
TP: same in both
TN: different in both
FP: different in true, same in pred
FN: same in true, different in pred
def pair_category_matrix(y_true, y_pred):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError('y_true and y_pred must have the same shape')
same_true = y_true[:, None] == y_true[None, :]
same_pred = y_pred[:, None] == y_pred[None, :]
# 0=TN, 1=FP, 2=FN, 3=TP
cat = np.zeros_like(same_true, dtype=int)
cat[(~same_true) & same_pred] = 1
cat[same_true & (~same_pred)] = 2
cat[same_true & same_pred] = 3
np.fill_diagonal(cat, -1) # diagonal isn't a pair
return cat
cat = pair_category_matrix(y_true, y_pred)
# Count only i<j pairs
upper_mask = np.triu(np.ones_like(cat, dtype=bool), k=1)
vals = cat[upper_mask]
counts = {
"TN (different/different)": int((vals == 0).sum()),
"FP (merge error)": int((vals == 1).sum()),
"FN (split error)": int((vals == 2).sum()),
"TP (same/same)": int((vals == 3).sum()),
}
axis_labels = [f"{i}<br>t={y_true[i]}, p={y_pred[i]}" for i in range(len(y_true))]
name_map = {
-1: "diag",
0: "TN (different/different)",
1: "FP (merge error)",
2: "FN (split error)",
3: "TP (same/same)",
}
cat_name = np.vectorize(name_map.get)(cat)
c_diag = "#ffffff"
c_tn = "#d9d9d9"
c_fp = "#ff7f0e"
c_fn = "#1f77b4"
c_tp = "#2ca02c"
# With zmin=-1, zmax=3, normalized boundaries between integers are:
# -0.5 -> 0.125, 0.5 -> 0.375, 1.5 -> 0.625, 2.5 -> 0.875
colorscale = [
[0.0, c_diag],
[0.1249, c_diag],
[0.125, c_tn],
[0.3749, c_tn],
[0.375, c_fp],
[0.6249, c_fp],
[0.625, c_fn],
[0.8749, c_fn],
[0.875, c_tp],
[1.0, c_tp],
]
fig = make_subplots(
rows=1,
cols=2,
column_widths=[0.72, 0.28],
subplot_titles=(
"Pair categories (upper triangle is unique)",
"Pair counts (i<j)",
),
)
fig.add_trace(
go.Heatmap(
z=cat,
x=axis_labels,
y=axis_labels,
zmin=-1,
zmax=3,
colorscale=colorscale,
showscale=False,
customdata=cat_name,
hovertemplate="%{y} vs %{x}<br>%{customdata}<extra></extra>",
),
row=1,
col=1,
)
fig.add_trace(
go.Bar(
x=list(counts.keys()),
y=list(counts.values()),
text=list(counts.values()),
textposition="outside",
marker_color=[c_tn, c_fp, c_fn, c_tp],
),
row=1,
col=2,
)
fig.update_layout(
title="Rand index works on pairs: agreements (TP/TN) vs disagreements (FP/FN)",
xaxis=dict(tickangle=0),
)
fig.update_xaxes(title_text="sample index (with labels)", row=1, col=1)
fig.update_yaxes(title_text="sample index (with labels)", row=1, col=1)
fig.update_xaxes(tickangle=45, row=1, col=2)
fig.update_yaxes(title_text="count", row=1, col=2)
fig.show()
2) Efficient computation (without looping over all pairs)#
A naive implementation checks all (inom{n}{2}) pairs, which is O(n^2).
We can compute the same TP/FP/FN/TN counts using a contingency table.
Let (n_{ij}) be the number of samples that are in:
true cluster (i), and
predicted cluster (j).
This forms a matrix (N = [n_{ij}]).
Key combinatorial identity#
The number of unordered pairs inside a group of size (m) is:
Counting pairs via the contingency table#
Pairs that are together in both partitions:
Pairs that are together in true labels:
Pairs that are together in pred labels:
Then:
Finally:
This gives an O(n + k_y k_\hat{y}) implementation (where k is number of clusters).
def comb2(x):
'''Unordered pairs: C(m, 2) = m(m-1)/2.'''
x = np.asarray(x, dtype=np.int64)
return (x * (x - 1)) // 2
def pairs_ordered(x):
'''Ordered pairs with i!=j inside a group: m(m-1) = 2*C(m,2).'''
x = np.asarray(x, dtype=np.int64)
return x * (x - 1)
def _labels_to_integers(labels):
labels = np.asarray(labels)
_, inv = np.unique(labels, return_inverse=True)
return inv
def contingency_matrix_np(y_true, y_pred):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError('y_true and y_pred must have the same shape')
true_inv = _labels_to_integers(y_true)
pred_inv = _labels_to_integers(y_pred)
n_true = int(true_inv.max(initial=-1)) + 1
n_pred = int(pred_inv.max(initial=-1)) + 1
cont = np.zeros((n_true, n_pred), dtype=np.int64)
np.add.at(cont, (true_inv, pred_inv), 1)
return cont
def pair_confusion_matrix_np(y_true, y_pred):
'''Match sklearn.metrics.cluster.pair_confusion_matrix.
scikit-learn counts ordered pairs (i, j) with i != j, so totals sum to n*(n-1).
'''
y_true = np.asarray(y_true)
n = int(y_true.size)
if n < 2:
return np.array([[0, 0], [0, 0]], dtype=np.int64)
cont = contingency_matrix_np(y_true, y_pred)
tp = int(pairs_ordered(cont).sum())
same_true = int(pairs_ordered(cont.sum(axis=1)).sum())
same_pred = int(pairs_ordered(cont.sum(axis=0)).sum())
fn = same_true - tp
fp = same_pred - tp
total_pairs = n * (n - 1)
tn = total_pairs - tp - fp - fn
return np.array([[tn, fp], [fn, tp]], dtype=np.int64)
def rand_score_np(y_true, y_pred):
'''Rand index over unordered pairs (i<j), matching sklearn.metrics.rand_score.'''
y_true = np.asarray(y_true)
n = int(y_true.size)
if n < 2:
return 1.0
cont = contingency_matrix_np(y_true, y_pred)
tp = float(comb2(cont).sum())
same_true = float(comb2(cont.sum(axis=1)).sum())
same_pred = float(comb2(cont.sum(axis=0)).sum())
fn = same_true - tp
fp = same_pred - tp
total_pairs = float(comb2(n))
tn = total_pairs - tp - fp - fn
return (tp + tn) / total_pairs
# Sanity checks vs scikit-learn
for n in [0, 1, 5, 50]:
y_true = rng.integers(0, 4, size=n)
y_pred = rng.integers(0, 6, size=n)
ri_np = rand_score_np(y_true, y_pred)
ri_sk = rand_score(y_true, y_pred)
pcm_np = pair_confusion_matrix_np(y_true, y_pred)
pcm_sk = pair_confusion_matrix(y_true, y_pred)
print(f"n={n:>2} rand_score close? {np.isclose(ri_np, ri_sk)} pair_confusion close? {np.all(pcm_np == pcm_sk)}")
n= 0 rand_score close? True pair_confusion close? True
n= 1 rand_score close? True pair_confusion close? True
n= 5 rand_score close? True pair_confusion close? True
n=50 rand_score close? True pair_confusion close? True
3) Properties worth remembering#
3.1) Invariant to relabeling#
If you permute cluster IDs (e.g., swap labels 0 and 1), the Rand score does not change.
That’s essential for clustering evaluation: cluster IDs are arbitrary.
y_true = np.array([0, 0, 1, 1, 2, 2])
y_pred = np.array([10, 10, 11, 11, 12, 12])
perm = {10: 12, 11: 10, 12: 11}
y_pred_perm = np.vectorize(perm.get)(y_pred)
print('original:', rand_score_np(y_true, y_pred))
print('permuted:', rand_score_np(y_true, y_pred_perm))
original: 1.0
permuted: 1.0
3.2) Extremes and trivial solutions#
Some degenerate clusterings can get surprisingly high Rand scores.
Two extremes:
All-in-one cluster: predicts every pair is “same”.
All-singletons: predicts every pair is “different”.
Because real datasets often have far more different-cluster pairs than same-cluster pairs, the “all-singletons” solution can score high even though it’s useless.
def all_in_one(n):
return np.zeros(n, dtype=int)
def all_singletons(n):
return np.arange(n, dtype=int)
n = 200
# True labels: 4 balanced clusters
k_true = 4
sizes = np.full(k_true, n // k_true)
y_true = np.repeat(np.arange(k_true), sizes)
y_all1 = all_in_one(n)
y_single = all_singletons(n)
def summarize(y_pred, name):
pcm_ord = pair_confusion_matrix_np(y_true, y_pred)
ri = rand_score_np(y_true, y_pred)
return name, ri, pcm_ord
for name, ri, pcm_ord in [
summarize(y_true, 'perfect'),
summarize(y_all1, 'all-in-one'),
summarize(y_single, 'all-singletons'),
]:
tn, fp, fn, tp = pcm_ord.ravel()
print(f"{name:>14} RI={ri:.4f} TP={tp:,} TN={tn:,} FP={fp:,} FN={fn:,} (ordered pairs)")
perfect RI=1.0000 TP=9,800 TN=30,000 FP=0 FN=0 (ordered pairs)
all-in-one RI=0.2462 TP=9,800 TN=0 FP=30,000 FN=0 (ordered pairs)
all-singletons RI=0.7538 TP=0 TN=30,000 FP=0 FN=9,800 (ordered pairs)
4) Pitfall: Rand score is not adjusted for chance#
If you assign labels at random, the expected Rand score is not a constant baseline like 0.
In fact, when the predicted clustering has many clusters, most pairs become “different” in the prediction, which creates lots of TN agreements.
This is a big reason why practitioners often prefer:
Adjusted Rand index (
adjusted_rand_score/ ARI): corrected so random labelings score around 0.
We’ll simulate this effect.
n = 200
k_true = 4
sizes = np.full(k_true, n // k_true)
y_true = np.repeat(np.arange(k_true), sizes)
k_pred_values = [2, 4, 8, 16, 32, 64, 128]
reps = 200
rand_scores = {k: [] for k in k_pred_values}
adj_rand_scores = {k: [] for k in k_pred_values}
for k in k_pred_values:
for _ in range(reps):
y_pred = rng.integers(0, k, size=n)
rand_scores[k].append(rand_score_np(y_true, y_pred))
adj_rand_scores[k].append(adjusted_rand_score(y_true, y_pred))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Rand score (unadjusted)", "Adjusted Rand score (ARI)"),
)
for k in k_pred_values:
fig.add_trace(go.Box(y=rand_scores[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=1)
fig.add_trace(go.Box(y=adj_rand_scores[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=2)
fig.update_layout(title="Random predictions: Rand score inflates as #clusters increases")
fig.update_yaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=2)
fig.show()
5) Using Rand score to select a simple clustering model (k-means example)#
Rand score is an external clustering metric: it requires reference labels.
That makes it useful when:
you’re benchmarking clustering algorithms on labeled datasets, or
you’re tuning hyperparameters in a semi-supervised evaluation setting.
Below we:
create synthetic blobs with known cluster IDs
run a small NumPy k-means implementation for different
kpick
kthat maximizes Rand score
Important: in real unsupervised tasks you typically don’t have y_true, so you’d use internal metrics (e.g., silhouette) instead.
# Synthetic 2D blobs
centers = np.array([[-2.0, -2.0], [2.0, -2.0], [-2.0, 2.0], [2.0, 2.0]])
cluster_std = 0.55
n_per = 80
X_list = []
y_true = []
for k, c in enumerate(centers):
Xk = rng.normal(loc=c, scale=cluster_std, size=(n_per, 2))
X_list.append(Xk)
y_true.append(np.full(n_per, k, dtype=int))
X = np.vstack(X_list)
y_true = np.concatenate(y_true)
# Shuffle
perm = rng.permutation(len(X))
X = X[perm]
y_true = y_true[perm]
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=y_true.astype(str),
title="Synthetic data (ground truth clusters)",
labels={"x": "x1", "y": "x2", "color": "true cluster"},
)
fig.show()
def kmeans_np(X, k, *, n_init=10, max_iter=100, tol=1e-4, rng=None):
if rng is None:
rng = np.random.default_rng(0)
n, d = X.shape
best_inertia = np.inf
best_centroids = None
best_labels = None
for _ in range(n_init):
centroids = X[rng.choice(n, size=k, replace=False)].copy()
for _ in range(max_iter):
# squared distances: (n, k)
d2 = ((X[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
labels = d2.argmin(axis=1)
new_centroids = np.empty_like(centroids)
for j in range(k):
mask = labels == j
if mask.any():
new_centroids[j] = X[mask].mean(axis=0)
else:
# handle empty clusters by re-seeding to a random point
new_centroids[j] = X[rng.integers(0, n)]
shift = np.linalg.norm(new_centroids - centroids)
centroids = new_centroids
if shift < tol:
break
inertia = float(((X - centroids[labels]) ** 2).sum())
if inertia < best_inertia:
best_inertia = inertia
best_centroids = centroids
best_labels = labels
return best_centroids, best_labels, best_inertia
k_values = list(range(2, 9))
results = []
for k in k_values:
centroids, labels, inertia = kmeans_np(X, k, n_init=15, max_iter=100, tol=1e-4, rng=rng)
ri = rand_score_np(y_true, labels)
ari = adjusted_rand_score(y_true, labels)
results.append((k, ri, ari, inertia))
results = np.array(results, dtype=float)
fig = make_subplots(rows=1, cols=2, subplot_titles=("Rand score vs k", "ARI vs k"))
fig.add_trace(go.Scatter(x=results[:, 0], y=results[:, 1], mode='lines+markers', name='RI'), row=1, col=1)
fig.add_trace(go.Scatter(x=results[:, 0], y=results[:, 2], mode='lines+markers', name='ARI'), row=1, col=2)
fig.update_xaxes(title_text="k", row=1, col=1)
fig.update_xaxes(title_text="k", row=1, col=2)
fig.update_yaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=2)
fig.update_layout(title="Selecting k with external labels (demo)")
fig.show()
best_k = int(results[results[:, 1].argmax(), 0])
print('best k by Rand score:', best_k)
centroids, labels_best, _ = kmeans_np(X, best_k, n_init=25, rng=rng)
fig = make_subplots(rows=1, cols=2, subplot_titles=("Ground truth", f"k-means prediction (k={best_k})"))
fig.add_trace(
go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers', marker=dict(color=y_true, colorscale='Viridis', size=6), showlegend=False),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers', marker=dict(color=labels_best, colorscale='Viridis', size=6), showlegend=False),
row=1,
col=2,
)
fig.update_xaxes(title_text="x1", row=1, col=1)
fig.update_xaxes(title_text="x1", row=1, col=2)
fig.update_yaxes(title_text="x2", row=1, col=1)
fig.update_yaxes(title_text="x2", row=1, col=2)
fig.update_layout(title="Clustering visualization (labels are arbitrary, Rand index is invariant)")
fig.show()
best k by Rand score: 4
6) Pros, cons, and when to use it#
Pros#
Permutation invariant: cluster IDs don’t matter.
Interpretable: literally the fraction of pairwise agreements.
Works for any number of clusters and doesn’t require distances/geometry.
Fast to compute via contingency counts (no need to enumerate all pairs).
Cons / pitfalls#
Not adjusted for chance: random clusterings can get high RI, especially with many clusters.
Often dominated by TN (different/different) pairs, which can hide meaningful errors.
Can reward over-segmentation (too many clusters) in some settings.
Not differentiable: not suitable as a gradient-based training loss.
Good use cases#
External evaluation of clustering when you have reference labels (benchmarks, ablations).
Comparing two clusterings of the same dataset when you care about pairwise co-membership.
If you need a chance-corrected baseline, prefer ARI (adjusted_rand_score).
7) Practical usage (scikit-learn)#
from sklearn.metrics import rand_score, adjusted_rand_score
ri = rand_score(y_true, y_pred)
ari = adjusted_rand_score(y_true, y_pred)
rand_score is a thin wrapper around the pair confusion matrix.
If you need debugging insight, inspect:
from sklearn.metrics.cluster import pair_confusion_matrix
pair_confusion_matrix(y_true, y_pred)
Exercises#
Create a clustering that splits one true cluster into two; verify how TP/FN change.
Create a clustering that merges two true clusters; verify how TN/FP change.
Simulate random labelings with different
kand estimate the expected Rand score.Implement
adjusted_rand_score(ARI) from the contingency table and compare to scikit-learn.
References#
W. M. Rand (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association.
L. Hubert, P. Arabie (1985). Comparing partitions. Journal of Classification. (Adjusted Rand index)
scikit-learn docs: https://scikit-learn.org/stable/modules/clustering.html#rand-index